16S Metagenomics by UNCODED using Illumina Next-Generation Sequencing

Krushna Chandra Murmu

Premas Life Sciences

5/2/2026

Greetings and Introduction to the workshop

  • Introduction to 16S rRNA Metagenomic Data Analysis

  • Analysis Workflow

  • Basics of WSL and R

Workshop data and materials

Overall Agenda

  1. Expected Learning outcomes
  2. Basics of data analysis
  3. Small exercise

Expected Learning outcomes:

After taking this workshop, participants should be able to:

  1. Basic idea about working in linux.

  2. Able to configure and install tools for setting up any analysis environment.

  3. Handle raw sequencing files, generate fastq data and perform basic exploratory and statistical analysis.

What is the basic purpose of studying microbiome?

  • Who are they?
    (species identification; what genes they contain)

  • Where do they come from?

  • Are there similarities:

    • between communities of different conditions?
    • of similar conditions?
    • within a community?
    • over a time period?
  • What are they doing?

  • How are they doing?
    (factors influencing the community)

Experimental design and the workflow

Environmental DNA metabarcoding: Transforming how we survey animal and plant communities

Introduction to 16S rRNA Metagenomic Data Analysis

The workflow in statistical terms

Software needed for analysis

  • bcl2fastq or bclconvert (Linux OS or WSL)

  • R and R studio

Introduction to CLI

What is a shell?

A shell is a computer program that presents a command line interface which allows you to control your computer using commands entered with a keyboard .

How to access the shell - Windows

  • Command Prompt: Not useful for bioinformatics

  • PowerShell : Can run Conda/CLI tools

  • WSL

  • https://sandbox.bio/

How to access the shell – Linux in My Windows?

  • Windows introduced a Linux compatibility layer which allows running Linux inside Windows.

  • Install from Windows Store

Terminal and shell

  • Terminal: A program that runs a shell

  • Shell: A program that interprets commands

  • Bash: The most common shell on Linux

  • zsh: The most common shell on MacOS

  • PowerShell: The most common shell on Windows

  • Cmd: The legacy shell on Windows

Why do we need CLI ?

  • Fast and efficient way to interact with your computer

  • Important part of your automation toolbox to create a reproducible data analysis pipeline

  • Accessing a remote server almost always requires some sort of command line skills

Basic commands and their usage

Description Win Linux, MacOS
Copy files, folders copy cp
Move files, folders move mv
List folder content dir ls
Create new folder mkdir mkdir
Change current folder cd cd
Show current path. cd pwd
Locate a software where which
Delete file(s) del rm
  • pwd

  • ls

  • ls -lh

  • cd 16s_workshop

  • cd ..

  • cd ~

  • mkdir data

  • mkdir images

  • mkdir results

  • cp sample_R1.fastq.gz data/

  • mv sample_R1.fastq.gz sample1_R1.fastq.gz

  • head metadata.tsv

  • grep -n “control” metadata.tsv

  • ls -lh data/*.fastq.gz

  • zcat data/sample1_R1.fastq.gz | head -n 12

  • zcat data/sample1_R1.fastq.gz | wc -l

  • touch results/notes.txt

Working With Compressed Files

Suffix Compress Command Extract Command Useful Arguments What it does
.zip zip unzip -d, -c, -f Windows-style archive, keeps multiple files together
.gz gzip gunzip -d, -c, -f Compresses a single file
.gz zcat View .gz file without extracting
.tar.gz tar -czf tar -xzf -x, -v, -f, -z Linux archive (multiple files + compression)
.bz2 bzip2 bunzip2 -d, -k, -f Like gzip but higher compression

Working with paths

Relative vs. Absolute Paths

A relative path specifies a location relative to the current directory which is a “fixed location” on your computer

Often, this “fixed location” is the so-called “working directory”

  • The dot . denotes the current working directory

  • The dot dot .. denotes the parent directory, i.e., it points upwards in the folder hierarchy

  • Finally, the tilde symbol ~ will bring you back to your home directory, e.g. cd ~

subfolder/file.txt # Inside a subfolder 

./file.txt # Current directory 

../file.txt # Parent directory

How do we get fastq files out of raw files ?

Demultiplexing

  • BCL output folder or RUN folder

  • Samplesheet

  • bcl2fastq or bclconvert

Setting up environment with conda

#https://docs.conda.io/en/latest/miniconda.html

#bash Miniconda3-latest-*.sh

#conda init

#conda create -n myenv
#conda activate myenv

#conda install -c bioconda bcl2fastq-nextseq
#conda install -c bioconda fastqc

Command : How to perform demux

#bcl2fastq --runfolder-dir /mnt/d/250114_VH01821_3_AAGHCCJM5 --output-dir /mnt/d/demugs --sample-sheet /mnt/d/250114_VH01821_3_AAGHCCJM5/samplesheet.csv

How to check QC of fastq files

Software required

  • fastqc

  • multiqc

QC check and phred score

Q = -10 x log10(P), where P is the probability that a base call is erroneous 

Phred Quality Score interpretation

Phred Quality Score Probability of incorrect base call Base call accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%

Running fastqc on fastq file and interpret the output

  • fastqc -h

  • fastqc -o results/ *.fastq.gz

Pre QC and Post QC of fastq files

Break

Loading R Session…

Introduction to R

R packages and bioconductor

R basics and getting started with R studio

R studio layout

Data Management

  • Saving your R environment (.Rdata)

  • Navigating directories

    • getwd()
  • Set working directories

    • setwd()

Basics of R Programming

To understand some of the most basic features of the R language including:

  • Creating R objects and understanding object types

  • Using mathematical operations

  • Using comparison operators

  • Creating, subsetting, and modifying vectors

Creating and deleting objects

#object with gene named 
gene_name<-"HDAC9"
gene_name
[1] "HDAC9"

#Delete the object

#object with gene named 'tp53'
rm(gene_name)

Object data types

gene_name<-"CHD1"
gene_name
[1] "CHD1"
class(gene_name)
[1] "character"

Mathematical operations

Operator Description
+ addition
- subtraction
* multiplication
/ division
^ or ** exponentiation
a%/%b integer division (division where the remainder is discarded)
a%%b modulus (returns the remainder after division)

create an object storing your own age and do some mathmatical operation and print it

Logical subsetting

Operator Description
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
!x not x
a | b a or b
a & b a and b

R Data Structures: Different class of data

  • factors

  • lists

  • data frames

  • matrices

Create a factor

gender <- factor(c("M","F","F","M","M","M"))
levels(gender)
[1] "F" "M"

Create a list

My_bacteria <- list(c("Corynebacterium jeikeium", "Acinetobacter calcoaceticus"),
 c("SRR1039508", "SRR1039509", "SRR1039512",
 "SRR1039513", "SRR1039516", "SRR1039517",
 "SRR1039520", "SRR1039521"),c(100,200,300,400))
#Look at the structure
str(My_bacteria)
List of 3
 $ : chr [1:2] "Corynebacterium jeikeium" "Acinetobacter calcoaceticus"
 $ : chr [1:8] "SRR1039508" "SRR1039509" "SRR1039512" "SRR1039513" ...
 $ : num [1:4] 100 200 300 400

Working with data frames

R import functions. These include read.csv(), read.table(), read.delim()

Working with libraries and data wrangling

Load the data and filter based on thresholds,counts,metadata

library(dplyr)
counts <- read.table("data/Data_for_R_analysis/airway_counts.txt",
                     header = TRUE,
                     sep = "\t",
                     row.names = 1)
meta <- read.table("data/Data_for_R_analysis/airway_metadata.txt",
                   header = TRUE,
                   sep = "\t")

Functions to test

nrow(counts)
[1] 63677
ncol(counts) 
[1] 8
head(counts) 
                SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
ENSG00000000003        679        448        873        408       1138
ENSG00000000005          0          0          0          0          0
ENSG00000000419        467        515        621        365        587
ENSG00000000457        260        211        263        164        245
ENSG00000000460         60         55         40         35         78
ENSG00000000938          0          0          2          0          1
                SRR1039517 SRR1039520 SRR1039521
ENSG00000000003       1047        770        572
ENSG00000000005          0          0          0
ENSG00000000419        799        417        508
ENSG00000000457        331        233        229
ENSG00000000460         63         76         60
ENSG00000000938          0          0          0
tail(counts)  
                SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
ENSG00000273488          7          5          8          3          8
ENSG00000273489          0          0          0          1          0
ENSG00000273490          0          0          0          0          0
ENSG00000273491          0          0          0          0          0
ENSG00000273492          0          0          1          0          0
ENSG00000273493          0          0          0          0          1
                SRR1039517 SRR1039520 SRR1039521
ENSG00000273488         16         11         14
ENSG00000273489          1          0          0
ENSG00000273490          0          0          0
ENSG00000273491          0          0          0
ENSG00000273492          0          0          0
ENSG00000273493          0          0          0
colnames(counts) 
[1] "SRR1039508" "SRR1039509" "SRR1039512" "SRR1039513" "SRR1039516"
[6] "SRR1039517" "SRR1039520" "SRR1039521"
#rownames(counts) 

EDA : Exploratory data analysis exercise

Read and transform the data

counts <- read.table("data/Data_for_R_analysis/airway_counts.txt", header = TRUE, sep = "\t", row.names = 1, check.names = FALSE)
meta   <- read.table("data/Data_for_R_analysis/airway_metadata.txt", header = TRUE, sep = "\t", check.names = FALSE)


stopifnot(all(colnames(counts) %in% meta$Run))
counts <- counts[, meta$Run]  # reorder columns to match metadata

# transform
log_counts <- log2(counts + 1)

libsize <- colSums(counts)
cpm <- t( t(counts) / libsize * 1e6 )
log_cpm <- log2(cpm + 1)
dim(log_cpm)
[1] 63677     8

Subset metadata

table(meta$dex)

  trt untrt 
    4     4 
keep_samples <- meta$dex %in% c("trt", "untrt")
meta_sub <- meta[keep_samples, ]
X <- log_cpm[, meta_sub$Run]
dim(X)
[1] 63677     8

Subset genes or features with enough counts

counts_X <- counts[rownames(X), meta_sub$Run, drop = FALSE]

keep_genes2 <- rowMeans(counts_X) >= 10

Xf <- X[keep_genes2, , drop = FALSE]

dim(Xf)
[1] 16090     8

Box-plot to check the distribution

# install.packages(c("ggplot2", "pheatmap", "corrplot"))
library(ggplot2)
library(pheatmap)
library(corrplot)


df <- data.frame(
  value = as.numeric(Xf),
  sample = rep(colnames(Xf), each = nrow(Xf))
)

ggplot(df, aes(x = sample, y = value)) +
  geom_boxplot(outlier.shape = NA) +
  theme_minimal() +
  labs(
    title = "Log-CPM per sample",
    x = "Sample",
    y = "log2(CPM + 1)"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Boxplot with groups

library(ggplot2)

df <- data.frame(
  value = as.numeric(Xf),
  sample = rep(colnames(Xf), each = nrow(Xf))
)

df <- merge(df, meta_sub, by.x = "sample", by.y = "Run")

ggplot(df, aes(x = cell, y = value, fill = cell)) +
  geom_boxplot(outlier.shape = NA) +
  theme_minimal() +
  labs(
    title = "Log-CPM by cell type",
    x = "Cell type",
    y = "log2(CPM + 1)"
  )

PCA : Select features

vars <- apply(Xf, 1, var)
top <- order(vars, decreasing = TRUE)[1:500]   # top 500 variable genes
X_top <- X[top, ]

pca <- prcomp(t(X_top), center = TRUE, scale. = FALSE)

pca_df <- data.frame(
  Sample = rownames(pca$x),
  PC1 = pca$x[, 1],
  PC2 = pca$x[, 2]
)

#pca_df <- merge(pca_df, meta_sub, by.x = "Sample", by.y = "sample")
pca_df <- merge(pca_df, meta_sub, by.x = "Sample", by.y = "Run")


ggplot(pca_df, aes(PC1, PC2, color = dex)) +
  geom_point(size = 4) +
  theme_minimal() +
  labs(title = "PCA (top variable genes)", x = "PC1", y = "PC2")

Correlation plot among the samples

cor_mat <- cor(X, method = "pearson")

heatmap(cor_mat,
        main = "Sample correlation heatmap",
        col = colorRampPalette(c("blue", "white", "red"))(100),
        margins = c(6, 6))

Plotting features and samples hierarchical clustering

sds <- apply(Xf, 1, sd)

# Keep only genes with variation
X_var <- X[sds > 0, , drop = FALSE]

# Pick top 50 most variable
top50 <- order(apply(X_var, 1, sd), decreasing = TRUE)[1:50]
X50  <- X_var[top50, , drop = FALSE]



library(pheatmap)

ann <- meta_sub[, c("cell", "dex"), drop = FALSE]
rownames(ann) <- meta_sub$Run
pheatmap(X50,
         scale = "row",
         annotation_col = ann,
         show_rownames = TRUE,
         show_colnames = TRUE,
         fontsize_row = 8,
         fontsize_col = 10,
         main = "Top 50 variable genes")

Microbiome analysis: Challenges and Concepts

Challenges

  • ▶ Heterogeneity.

  • ▶ Poor data quality.

  • ▶ Structured high-dimensionality.

  • ▶ Reproducibility.

Concepts

OTU vs ASV: Two Approaches to Defining Microbial Features

Aspect OTUs (Operational Taxonomic Units) ASVs (Amplicon Sequence Variants)
Definition OTUs are clusters of sequences grouped by similarity, usually at a 97% identity threshold. ASVs are exact DNA sequences inferred from the data after removing sequencing errors.
Basic idea Similar sequences are grouped into the same unit. Every unique biological sequence is kept.

How OTUs and ASVs Work

Step OTU (Clustering approach) ASV (Error-model approach)
1 Start with all sequencing reads Start with all sequencing reads
2 Compare sequences to each other Learn the sequencing error pattern
3 Measure similarity between sequences Separate real variation from noise
4 Group sequences above a similarity threshold (e.g. 97%) Infer true biological sequences
5 Merge similar sequences into one OTU Keep exact sequences (even 1 base difference)
6 Assign taxonomy to each OTU Assign taxonomy to each ASV
7 Build OTU table (counts per sample) Build ASV table (counts per sample)

Tools for OTU and ASV Workflows

Workflow type Tool Description Typical usage
OTU-based Mothur A well-established pipeline for OTU clustering and classical microbiome analysis. Legacy studies, teaching traditional pipelines
VSEARCH Open-source tool for similarity-based clustering and chimera detection. Fast OTU clustering, open-source replacement for USEARCH
USEARCH Classic OTU software (not fully open-source, but widely cited). Historical reference, many old pipelines
ASV-based DADA2 (R) Most popular ASV inference tool; builds error models and integrates with phyloseq. Gold standard in R-based workflows
Deblur (QIIME2) Fast and robust ASV inference using static error profiles. Large datasets, quick processing
QIIME 2 Modern microbiome platform supporting both OTUs (legacy) and ASVs (via DADA2/Deblur). End-to-end analysis, most widely used platform today

Various 16s amplicon clustering methods benchmarking

Getting the data

#curl -L -o CCAMP_workshop.tar.gz https://ndownloader.figshare.com/files/28773936
#tar -xzvf CCAMP_workshop.tar.gz

Commands and function for processing and analysis of 16s data

  Command What we’re doing
1 cutadapt/filterAndTrim() remove primers and quality trim/filter
2 learnErrors() generate an error model of our data
3 derepFastq dereplicate sequences
4 dada() infer ASVs on both forward and reverse reads independently
5 mergePairs() merge forward and reverse reads to further refine ASVs
6 makeSequenceTable() generate a count table
7 removeBimeraDenovo() screen for and remove chimeras
8 IdTaxa() assign taxonomy

Files will get after processing through DADA2

  • a fasta file of ASVs

  • count table

  • taxonomy table.

Dada2 R workflow

Differential abundance workflow

How to make dada2 output for differential abundance processing? : phyloseq

Working with Synthetic data

set.seed(1)

# if (!requireNamespace("BiocManager", quietly = TRUE))
#   install.packages("BiocManager")
# 
# BiocManager::install("phyloseq")
# install.packages(c("ggplot2", "vegan"))


library(phyloseq)
library(ggplot2)
library(vegan)

n_taxa <- 50
n_samples <- 12

taxa <- paste0("Taxon_bact", 1:n_taxa)

# Fake sample names
samples <- paste0("Sample_", 1:n_samples)

# Simulated count data (like sequencing reads)
counts <- matrix(
  rpois(n_taxa * n_samples, lambda = 100),
  nrow = n_taxa,
  ncol = n_samples,
  dimnames = list(taxa, samples)
)

head(counts)


metadata <- data.frame(
  Sample = samples,
  BodySite = rep(c("Gut", "Skin", "Oral"), each = 4),
  DiseaseStatus = rep(c("Healthy", "Disease"), times = 6),
  Treatment = rep(c("Control", "Drug"), times = 6)
)

metadata



OTU  <- otu_table(as.matrix(counts), taxa_are_rows = TRUE)
META <- sample_data(metadata)

library(phyloseq)

rownames(metadata) <- metadata$Sample

OTU  <- otu_table(as.matrix(counts), taxa_are_rows = TRUE)
META <- sample_data(metadata)

stopifnot(identical(sample_names(OTU), sample_names(META)))

ps <- phyloseq(OTU, META)

tax <- data.frame(
  Kingdom = rep("Bacteria", nrow(counts)),
  Phylum  = sample(c("Firmicutes", "Bacteroidetes", "Proteobacteria"),
                   nrow(counts), replace = TRUE),
  row.names = rownames(counts)
)

ps <- merge_phyloseq(ps, tax_table(as.matrix(tax)))

ps



ps
sample_sums(ps)  # Total reads per sample
taxa_sums(ps)    # Total reads per taxon

#ps_rare <- rarefy_even_depth(ps, rngseed = 123)
#rarecurve(t(otu_table(ps_rare)), step = 50, cex = 0.5)

alpha_div <- estimate_richness(ps, measures = c("Observed", "Shannon", "Simpson"))

alpha_meta <- cbind(alpha_div, sample_data(ps))

library(ggpubr)
p1 <- ggplot(alpha_meta, aes(x = BodySite, y = Shannon, fill = BodySite)) +
  geom_boxplot() +
  geom_jitter(width = 0.2) +
  theme_bw() +
  labs(title = "Shannon Diversity by Body Site")

p2 <- ggplot(alpha_meta, aes(x = DiseaseStatus, y = Shannon, fill = DiseaseStatus)) +
  geom_boxplot() +
  geom_jitter(width = 0.2) +
  theme_bw() +
  labs(title = "Shannon Diversity by Disease Status")

ggarrange(p1, p2, ncol = 2)




# Taxonomic Composition

# Transform to relative abundance
ps_rel <- transform_sample_counts(ps, function(x) x / sum(x) * 100)

# Bar plot at Phylum level (top 5 phyla)
top_phyla <- names(sort(taxa_sums(ps_rel), decreasing = TRUE))[1:5]
ps_top <- prune_taxa(top_phyla, ps_rel)

plot_bar(ps_top, x = "Sample", fill = "Phylum") +
  facet_grid(~ BodySite, scales = "free_x", space = "free") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(y = "Relative Abundance (%)", 
       title = "Top 5 Phyla by Body Site")

# Stacked bar plot by group
plot_bar(ps_rel, x = "BodySite", fill = "Phylum") +
  geom_bar(stat = "identity", position = "fill") +
  theme_bw() +
  labs(y = "Proportion", title = "Phylum Composition by Body Site") +
  facet_wrap(~ DiseaseStatus)



ps_hel <- transform_sample_counts(ps, function(x) sqrt(x / sum(x)))

# Bray-Curtis 
ord_bray <- ordinate(ps_hel, method = "PCoA", distance = "bray")

p3 <- plot_ordination(ps_hel, ord_bray, color = "BodySite", shape = "DiseaseStatus") +
  geom_point(size = 4) +
  theme_bw() +
  stat_ellipse(aes(group = BodySite)) +
  labs(title = "PCoA - Bray-Curtis")

# NMDS plot
ord_nmds <- ordinate(ps_hel, method = "NMDS", distance = "bray")

p4 <- plot_ordination(ps_hel, ord_nmds, color = "BodySite", shape = "Treatment") +
  geom_point(size = 4) +
  theme_bw() +
  labs(title = "NMDS - Bray-Curtis")

library(patchwork)
p3 + p4



## clustering
ps_rel <- transform_sample_counts(ps, function(x) x / sum(x))
top_taxa <- names(sort(taxa_sums(ps_rel), decreasing = TRUE))[1:20]
ps_top <- prune_taxa(top_taxa, ps_rel)
mat <- as.matrix(otu_table(ps_top))
if (!taxa_are_rows(ps_top)) mat <- t(mat)
sample_annot <- data.frame(
  DiseaseStatus = sample_data(ps_top)$DiseaseStatus,
  BodySite      = sample_data(ps_top)$BodySite
)

rownames(sample_annot) <- sample_names(ps_top)
tax_annot <- data.frame(
  Phylum = tax_table(ps_top)[, "Phylum"]
)

rownames(tax_annot) <- taxa_names(ps_top)

library(pheatmap)

pheatmap(
  mat,
  scale = "row",                     # highlight patterns per taxon
  annotation_col = sample_annot,     # sample metadata
  annotation_row = tax_annot,        # taxonomic info
  show_rownames = TRUE,
  show_colnames = TRUE,
  fontsize_row = 8,
  fontsize_col = 9,
  main = "Heatmap of top taxa (relative abundance)"
)

Exercise

  • Set up a Linux environment on Windows using WSL

  • Create an isolated software environment using conda

  • Install cutadapt in the conda environment

  • Run a published 16S rRNA analysis using DADA2

  • Understand how ASVs are generated from raw sequencing data